m Cell - DNA — RNA - protein 


m Sequencing methods 
m arising questions for handling the data, making sense of it 


m next two week lectures: sequence alignment and genome 
assembly 


Cells 


Fundamental working units of every living system. 
Every organism is composed of one of two radically different types of cells: 


— prokaryotic cells 
— eukaryotic cells which have DNA inside a nucleus. 


Prokaryotes and Eukaryotes are descended from primitive cells and the results of 
3.5 billion years of evolution. 
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Prokaryotes and Eukaryotes 


According to the most recent 
evidence, there are three 
main branches to the tree of 
life 


Prokaryotes include Archaea 
(“ancient ones”) and bacteria 


Eukaryotes are kingdom 
Eukarya and includes plants, 
animals, fungi and certain 
algae 


> Lecture: Phylogenetic trees, 
this topic in more detail 
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e Born, eat, replicate, and die 
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Common features of organisms 


Chemical energy is stored in ATP 
Genetic information is encoded by DNA 
Information is transcribed into RNA 
There is a common triplet genetic code 
— some variations are known, however 
Translation into proteins involves ribosomes 


Shared metabolic pathways 


Similar proteins among diverse groups of organisms 


All Life depends on 3 critical molecules 


e DNAs (Deoxyribonucleic acid) 
— Hold information on how cell works 


e RNAs (Ribonucleic acid) 
— Act to transfer short pieces of information to different parts of cell 
— Provide templates to synthesize into protein 


e Proteins 
— Form enzymes that send signals to other cells and regulate gene 
activity 
— Form body’s major components 
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DNA structure 


DNA has a double helix structure 
" which is composed of 
uL — Sugar molecule 


£ _ phosphate group 


A 
8 / 
<A anda base (A,C,G,T) 
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motis strings in direction of 
mud ge transcription: from 5' end to 3' 
© end 
yd 5’ ATTTAGGCC 3' 
din 3' TAAATCCGG 5' 


DNA is contained in chromosomes 


DNA 


Isolated patches. 


Active Chromosome —— The Metaphase Chromosome 


During interphase. — — — During cell division. 


Add core histones 
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In eukaryotes, DNA is packed into linear chromosomes 


In prokaryotes, DNA is usually contained in a single, circular 
chromosome 


http:// en.wikipedia.org/ wiki/Image:Chromatin Structures.png 


Human chromosomes 


Somatic cells (cells in all, except 
the germline, tissues) in humans 
have 2 pairs of 22 chromosomes 
+XX (female) or XY (male) =total 
of 46 chromosomes 


Germline cells have 22 
chromosomes +either X or Y = 
total of 23 chromosomes 
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Karyogram of human male using Giemsa staining 
(http://en.wikipedia.org/wiki/Karyotype) 


RNA 


e RNA is similar to DNA chemically. It is usually only a single strand. 
T(hyamine) is replaced by U(racil) 


e Several types of RNA exist for different functions in the cell. 
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Anticodon 
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tRNA linear and 3D view: http://www.cgl.ucsf.edu/home/glasfeld/tutorial/trna/trna.gif 
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DNA, RNA, and the Flow of Information 


Replication The central dogma 


——. DNA can replicate. 


Translation 


Is this true? 


Information coded in the 


sequence of base pairs in DNA 
[s passed to molecules of RNA. from proteins to nucleic acids. 


Denis Noble: The principlesof Systems Biology illustrated using the virtual heart 
http:// velblod.videolectures.net/ 2007/pascal/eccs07 dresden/noble denis/eccs07 noble psb Ol.ppt 


Proteins 


Proteins are polypeptides (strings 
of amino acid residues) 
Represented using strings of 
letters from an alphabet of 20: 
AEGLV..WKKLAG 

Typical length 50...1000 residues 


Urease enzyme from Helicobacter pylori 
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http://upload.wikimedia.org/wikipedia/commons/c/c5/Amino_acids 2.png 
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How DNA/RNA codes for protein? 


DNA alphabet contains four 
letters but must specify protein, 
or polypeptide sequence of 20 
letters. 

Trinucleotides (triplets) allow 4? = 
64 possible trinucleotides 


Triplets are also called codons 
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Proteins 


20 different amino acids 


— different chemical properties cause the protein chains to fold up into specific 
three-dimensional structures that define their particular functions in the cell. 


Proteins do all essential work for the cell 
— build cellular structures 
— digest nutrients 
— execute metabolic functions 


— mediate information flow within a cell and among cellular communities. 


Proteins work together with other proteins or nucleic acids as "molecular 
machines" 


— structures that fit together and function in highly specific, lock-and-key ways. 


Genes 


“A gene is a union of genomic sequences encoding a coherent set of 
potentially overlapping functional products" 


A DNA segment whose information is expressed either as an RNA 
molecule or protein 
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Genes & alleles 


e A gene can have different variants 


* The variants of the same gene are called 
alleles 


Genes can be found on both strands 


Exons and introns & splicing 


Exons 


3’ D’ 
Introns are removed from RNA after transcription 


Exons are joined: A  — 11. 


This process is called splicing 


Alternative splicing 


Different splice variants may be generated 


DNA and continuum of life... 


0 
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e Prokaryotes are typically haploid: 


they have a single (circular) $ 73 
chromosome ee rN 
e DNA is usually inherited vertically $3. 2 A 
(parent to daughter) SS 228 
e Inheritance is clonal EE > e. dysenteriae 19 i == 
— Descendants are faithful copies ES Y EL em H SE. ES 
of an ancestral DNA PES / v GE 
VON | 2^ Ls SF 
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mutations, transposable LKA ^" Er E WYO 
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Chromosome map of S. dysenteriae, the nine rings 


describe different properties of the genome 
http://www.mgc.ac.cn/ShiBASE/circular_Sd197.htm 
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Biological string manipulation 


* Point mutation: substitution of a base 
— ,,ACGGCT...=>..ACGCCT... 
* Deletion: removal of one or more contiguous bases 
(substring) 
— ,.ITGATCA... =>...TTTCA... 
e Insertion: insertion of a substring 
— ,,.GGCTAG...=>...GGTCAACTAG... 


Lecture: Sequence alignment 
Lecture: Genome rearrangements 


Genome sequencing & assembly 


* DNA sequencing 
— How do we obtain DNA sequence information from organisms? 


e Genome assembly 
— What Is needed to put together DNA sequence information from sequencing? 


e First statement of sequence assembly problem: 
— Peltola, Soderlund, Tarhio, Ukkonen: Algorithms for some string matching 
problems arising in molecular genetics. Proc. 9th IFIP World Computer 
Congress, 1983 


Recovery of shredded newspaper 


DNA sequencing 


e DNA sequencing: resolving a nucleotide sequence (whole-genome or less) 
e Many different methods developed 

— Maxam-Gilbert method (1977) 

— Sanger method (1977) 

— High-throughput methods, "next-generation" methods 
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Sanger sequencing: sequencing by synthesis 


A sequencing technique developed by 1977 
Also called dideoxy sequencing 
A DNA polymerase is an enzyme 
that catalyzes DNA synthesis 
DNA polymerase needs a primer 
Synthesis proceeds always in 5’->3’ direction 
In Sanger sequencing, chain-terminating 
dideoxynucleoside triphosphates (ddXTPs) are employed 
— ddATP, ddCTP, ddGTP, ddTTP 
lack the 3’-OH tail of dXTPs 
A mixture of dXTPs with small amount of ddXTPs 
is given to DNA polymerase with DNA template and primer 
ddXTPs are given fluorescent labels 
When DNA polymerase encounters a ddXTP, the synthesis 
cannot proceed 
The process yields copied sequences of different lengths 
Each sequence is terminated by a labeled ddXTP 
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Determining the sequence 


Sequences are sorted according to 
length by capillary electrophoresis 


Fluorescent signals corresponding to 


labels are registered 
AA ACAACTTGOGTAAGTATA 


Base calling: identifying which base 
corresponds to each position in a 
read Output sequences from 


— Non-trivial problem! base calling are called reads 


Reads are short! 


e Modern Sanger sequencers can produce quality reads up to ~750 bases! 


— Instruments provide you with a quality file for bases in reads, in addition to 
actual sequence data 


e Compare the read length against the size of the human genome (2.9x10? bases) 


e Reads have to be assembled! 


Problems 


Sanger sequencing error rate per base varies from 1% to 3%! 
Repeats in DNA 


— For example, ~300 base longs Alu sequence repeated is over million times in 
human genome 


— Repeats occur in different scales 
What happens if repeat length is longer than read length? 


Shortest superstring problem 
— Find the shortest string that "explains" the reads 
— Given a set of strings (reads), find a shortest string that contains all of them 


Sequence assembly and combination locks 


e What is common with sequence assembly and opening keypad locks? 


Whole-genome shotgun sequence 


Whole-genome shotgun sequence assembly starts with a large sample of 
genomic DNA 


Sample is randomly partitioned into inserts of length » 500 bases 


Inserts are multiplied by cloning them into a vector which is used to infect 
bacteria 


3. DNA is collected from bacteria and sequenced 
Reads are assembled 


Assembly of reads with Overlap-Layout- 


Consensus algorithm 
Overlap 
— Finding potentially overlapping reads 
Layout 


— Finding the order of reads along DNA 
Consensus (M ultiple alignment) 
— Deriving the DNA sequence from the layout 


Next, the method is described at a very abstract level, skipping a lot of details 
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Finding overlaps 


First, pairwise overlap alignment of 
reads is resolved 


Reads can be from either DNA strand: 
The reverse complement r* of each 
read r has to be considered 


acggagtcc 
agtccgcgctt 


r4: tgagt, r, : actca 
rə: tCCaC, r, : gtgga 


Example sequence to assemble 


5 — CAGCGCGCTGCGTGACGAGTCTGACAAAGACGGTATGCGCATCG 
TGATTGAAGTGAAACGCGATGCGGTCGGTCGGTGAAGTTGTGCT - 3’ 


e 20 reads: 


di Read Read* Read Read* 
CATCGTCA TCACGATG GGTCGGTG CACCGACC 
CGGTGAAG CTTCACCG ATCGTGAT ATCACGAT 
TATGCGCA TGCGCATA GCGCTGCG CGCAGCGC 
GACGAGTC GACTCGTC GCATCGTG CACGATGC 
CTGACAAA TTTGTCAG AGCGCGCT AGCGCGCT 
ATGCGCAT ATGCGCAT GAAGTTGT ACAACTTC 
ATGCGGTC GACCGCAT AGTGAAAC GTTTCACT 
CTGCGTGA TCACGCAG ACGCGATG CATCGCGT 
GCGTGACG CGTCACGC GCGCATCG CGATGCGC 
GTCGGTGA TCACCGAC AAGTGAAA TITCACTT 
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Finding overlaps 


e Overlap between two reads can Overlap(1, 6) = 3 
be found with a dynamic 
programming algorithm 

— Errors can be taken into account 


e Dynamic programming will be 
discussed more during the next 
two weeks 


* Overlap scores stored into the 1 3 7 
overlap matrix 


— Entries (i, j) below the diagonal 
denote overlap of read r; and r 


Finding layout & consensus 


M ethod extends the assembly 
greedily by choosing the best 
overlaps 


Both orientations are considered 


Sequence is extended as far as 
possible 14 


consensus sequence 


Ambiguous bases 


GCATCGTG 
CATCGTGA 
ATCGTGAT 


Finding layout & consensus 


We move on to next best 
overlaps and extend the 
sequence from there 


The method stops when there are 2 CGGTGAAG 

no more overlaps to consider 10 GTCGGTGA 
11 GGTCGGTG 

Anumber of contigsisproduced 7 ATECGETE 

Contig stands for contiguous RATE CCCETOCOTSANE 


sequence, resulting from merging 
reads 


Whole-genome shotgun sequencing: 
summar 


Original genome sequence 


a Fe a [uses Cc lS [LU na 
Reads P1 : 
Non-overlapping Overlapping reads 
read => Contig 


Ordering of the reads is initially unknown 
Overlaps resolved by aligning the reads 


In a 3x10? bp genome with 500 bp reads and 5x coverage, there are ~10’ reads and 
-107(107-1)/2 =-5x10% pairwise sequence comparisons 


Repeats in DNA and genome assembly 


Two instances of the same repeat 


Figure 2. Repeat 
sequence. The top 
represents the cor- 
rect layout of three 
DNA sequences. The 
bottom shows a 
repeat collapsed in 
a misassembly. 


rptiA II rpt1B I 
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Repeats in DNA cause problems in 
sequence assembly 


° Recap: if repeat length exceeds read length, we might not get the correct 
assembly 


. This is a problem especially in eukaryotes 
— . «3.196 of genome consists of repeats in Drosophila, ~45% in human 


. Possible solutions 


1. Increase read length - feasible? 


2. Divide genome into smaller parts, with known order, and sequence parts 
individually 


" Divide and conquer" sequencing 
approaches: BAC-by-BAC 


Whole-genome shotgun sequencing 
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BAC-by-BAC sequencing 


e Each BAC (Bacterial Artificial Chromosome) is about 150 kbp 
e Covering the human genome requires ~30000 BACs 
e BACs shotgun-sequenced separately 


— Number of repeats in each BAC is significantly smaller than in the whole 
genome... 


— ..heeds much more manual work compared to whole-genome shotgun 
sequencing 
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Hybrid method 


e Divide-and-conquer and whole-genome shotgun approaches can be combined 


— Obtain high coverage from whole-genome shotgun sequencing for short 
contigs 


— Generate of a set of BAC contigs with low coverage 
— Use BAC contigs to "bin" short contigs to correct places 
e This approach was used to sequence the brown Norway rat genome in 2004 


First whole-genome shotgun sequencing 
project: Drosophila melanogaster 


e Fruit fly isa common model organism 
in biological studies 

e Whole-genome assembly reported in 
Eugene M yers, et al., A Whole- 
Genome Assembly of Drosophila, 
Science 24, 2000 


e Genome size 120 Mbp 


44 
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Sequencing of the Human Genome 


The (draft) human genome was 
published in 2001 


Science 


Two efforts: 


— Human Genome Project (public 
consortium) 


— Celera (private company) 


HGP: BAC-by-BAC approach 
HGP: Nature 15 February 2001 


Vol 409 Number 6822 
Celera: whole-genome shotgun 


sequencing 
Celera: Science 16 February 2001 
Vol 291, Issue 5507 


Sequencing of the Human Genome 


The (draft) human genome cde 
was published in 2001 i EO m 
Two efforts: bait j 7 


— Human Genome Project 
(public consortium) 


— Celera (private company) 
HGP: BAC-by-BAC approach 
Celera: whole-genome HGP: Nature 15 February 2001 
shotgun sequencing Vol 409 Number 6822 


Celera: Science 16 February 2001 
Vol 291, Issue 5507 


Next-gen sequencing: 454 


e Sanger sequencing is the prominent first-generation sequencing method 
e Many new sequencing methods are emerging 


e Genome Sequencer FLX (454 Life Science / Roche) 
— »100Mb/7.5hrun 
— Read length 250-300 bp 
— >99.5% accuracy / base in a single run 
— 299.9995 accuracy / base in consensus 


a 
DNA library preparation 


LZ Ligation E «Genome fragmented 
C — by nebulization 
Selection *No cloning; no colony 
— _ += WEA picking 
L——— fragments *SstDNA library created 
only) with adaptors 
Y e -A/B fragments selected 
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purification 
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Emulsion PCR 


Anneal sstDNA to an excess of  Emulsify beads and PCR Clonal amplification occurs Break microreactors and 
DNA capture beads reagents in water-in-oil inside microreactors enrich for DNA-positive 
microreactors beads 


sstDNA library 


Bead-amplified sstDNA library 


«Well diameter: average of 44 ym 
*400,000 reads obtained in parallel 


*A single cloned amplified sstDNA 
bead is deposited per well 


Amplified sstDNA library beads. —Ó——» Quality filtered bases 


h Mardis ER. 2008. 


Annu. Rev. Genomics Hum. Genet. 9:387—402 


The method used by the Roche/454 sequencer 
to amplify single-stranded DNA copies from a 
fragment library on agarose beads. 

A mixture of DNA fragments with agarose beads 
containing complementary oligonucleotides to the 
adapters at the fragment ends are mixed in an 
approximately 1:1 ratio. 

The mixture is encapsulated by vigorous 
vortexing into aqueous micelles that contain PCR 
reactants surrounded by oil, and pipetted into a 
96-well microtiter plate for PCR amplification. 
The resulting beads are decorated with 
approximately 1 million copies of the original 
single-stranded fragment, which provides 
sufficient signal strength during the 
pyrosequencing reaction that follows to detect 
and record nucleotide incorporation events. 


sstDNA, single-stranded template DNA. 


Next-gen sequencing: Illumina Solexa 


e Illumina / Solexa Genome Analyzer 
— Read length 35 - 50 bp 
— 1-2 Gb/ 3-6 day run 
— >98.5% accuracy / base in a single run 
— 99.99% accuracy / consensus with 3x coverage 
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Sequence read over multiple chemistry cycles 


Repeat cycles of sequencing to determine the sequence 
of bases in a given fragment a single base at a time. 


e TA 
> — > UCTOUA... 
A ee 
A_| Mardis ER. 2008. 


Annu. Rev. Genomics Hum. Genet. 9:387—402 


The Illumina sequencing-by-synthesis 
approach. Cluster strands created by bridge 
amplification are primed and all four 
fluorescently labeled, 3’-OH blocked 
nucleotides are added to the flow cell with DNA 
polymerase. The cluster strands are extended 
by one nucleotide. Following the incorporation 
step, the unused nucleotides and DNA 
polymerase molecules are washed away, a 
scan buffer is added to the flow cell, and the 
optics system scans each lane of the flow cell 
by imaging units called tiles. Once imaging is 
completed, chemicals that effect cleavage of 
the fluorescent labels and the 3'-OH blocking 
groups are added to the flow cell, which 
prepares the cluster strands for another round 


of fluorescent nucleotide incorporation. 


Next-gen sequencing: SOLID 


e SOLID 
— Read length 25-30 bp 
— 1-2 Gb / 5-10 day run 
— >99.94% accuracy / base 
— 299.9995, accuracy / consensus with 15x coverage 
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5. Repeat steps 1-4 to extend sequence 
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Mardis ER. 2008. 
AR Annu. Rev. Genomics Hum. Genet. 9:387—402 


b Data collection and image analysis 


2 ++ a AT AC AA GÀ 
u tee E | CG CA CC TC 
GC GT GG AG 

+++ 4 TA TG TT cT 
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AT G GA Bases quence 
The ligase-mediated sequencing approach of the Applied Biosystems SOLiD 
sequencer. In a manner similar to Roche/454 emulsion PCR amplification, DNA 
fragments for SOLID sequencing are amplified on the surfaces of 1-um magnetic 
beads to provide sufficient signal during the sequencing reactions, and are then 
deposited onto a flow cell slide. Ligase-mediated sequencing begins by annealing 
a primer to the shared adapter sequences on each amplified fragment, and then 
DNA ligase is provided along with specific fluorescent-labeled 8mers, whose 4th 
and 5th bases are encoded by the attached fluorescent group. Each ligation step 
is followed by fluorescence detection, after which a regeneration step removes 
bases from the ligated 8mer (including the fluorescent group) and concomitantly 
prepares the extended primer for another round of ligation. (b) Principles of two- 
base encoding. Because each fluorescent group on a ligated 8mer identifies a 
two-base combination, the resulting sequence reads can be screened for base- 
calling errors versus true polymorphisms versus single base deletions by aligning 
the individual reads to a known high-quality reference sequence. 


